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Abstract 

Consider a collection of objects, some of which may be 'bad', and a test 
which determines whether or not a given sub-collection contains no bad ob- 
jects. The non-adaptive pooling (or group testing) problem involves identifying 
the bad objects using the least number of tests applied in parallel. The 'hyper- 
geometric' case occurs when an upper bound on the number of bad objects is 
known a priori. Here, practical considerations lead us to impose the additional 
requirement of a posteriori confirmation that the bound is satisfied. A gener- 
alization of the problem in which occasional errors in the test outcomes can 
occur is also considered. Optimal solutions to the general problem are shown 
to be equivalent to maximum-size collections of subsets of a finite set satisfying 
a union condition which generalizes that considered by Erdos et al. Lower 
bounds on the number of tests required are derived when the number of bad 
objects is believed to be either 1 or 2. Steiner systems are shown to be optimal 
solutions in some cases. 



1 Introduction 



Each of n objects has an unknown binary status, 'good' or 'bad'. A test is 
available which, except for occasional failures or errors, establishes whether or 
not all the objects in a given collection are good. The problem is to resolve the 
status of each object using the minimum number of tests applied in parallel. 
The corresponding adaptive problem, in which the choice of test at any stage 
can depend on the outcomes of previous tests, is sometimes known as 'group 
testing' (Wolf §). 

The objects may, for example, be electronic devices which can be tested 
in series. Another example involves items in a database which are categorized 
by a sequence of binary classifications and the task is to partition the objects 
according to the ith classification. The problem is formally similar to that of 
devising optimal error-correcting codes using parity checks, except that here the 
test result is 'at least one bad object' rather than 'an odd number of bad objects'. 
Our work is motivated by an optimal design problem for large-scale experiments 
aimed at constructing physical maps of human chromosomes: the objects are 
chromosome fragments which are 'bad' if they contain a certain DNA sequence. 
An experimental test known as the Polymerase Chain Reaction can determine 
whether or not a collection of chromosome fragments are all good. In order to 
facilitate automation, it is desirable that the experiments be applied in parallel 
so that the experimental design is non-adaptive, or one-stage. Here, we derive 
experimental designs which, with high probability, are one-stage solutions to 
an appropriate formalization of the problem. These designs may form stages in 
solutions to more general problems, for example adaptive (multi-stage) designs 
which are optimal subject to a cost function which penalizes additional stages. 

A pool is a set of objects and a design is a set of pools. Given a design D, let 
V denote the number of pools, so that v = We will say that a pool is good if 
all the objects in it are good, otherwise it is bad. Let P denote the total number 
of bad objects. The test usually distinguishes good pools from bad, but we will 
also allow the possibility that for some pools the test fails to produce a result 
and write Q for the number of pools in T> which fail. Before applying the tests 
P and Q are unknown, but we may have some prior information about them. 
One simple design consists of testing each object individually a fixed number of 
times. However if both P -C n and Q -C f then 'better' designs are possible. 

There are several reasonable optimality criteria for V. An appropriate choice 
will depend in part on the prior knowledge of P and Q. Bush et al. and 
Hwang & Sos discuss non-adaptive group testing in the 'hypergeometric' 
case, in which Q = and P is bounded above by a known constant p. They 
define T> to be an optimal solution if it maximizes n for fixed v among designs 
such that the status of each object can be inferred from the pool outcomes. The 
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hypergeometric formulation has the drawback that it assumes that the event 
P > p is excluded a priori. It is not in general possible to confirm a posteriori 
that P < p and hence false conclusions may be drawn if, unexpectedly, P > p. 
In practice, a large value of p must be chosen to exclude this possibility. Here, 
we modify the hypergeometric case by imposing the additional requirement 
that the event P > p can be distinguished a posteriori. Consequently, it will be 
reasonable in practice to allow a small prior probability that P > p. Typically, 
lower values of p can be chosen than under the hypergeometric formulation and 
hence more efficient designs constructed. The price for these advantages is that 
the designs are not strictly non-adaptive: with small probability a second stage 
will be required. 

Allowing also for up to q failures, we define V to be an optimal solution if 
it maximizes n for fixed v subject to the requirement that whenever Q < q we 
can infer from the pool outcomes either the status of each object or that P > p. 
Proposition 1 of section |^ establishes that optimal solutions V are equivalent 
to maximum-size collections of subsets of a f-set such that every subset in the 
collection has more than q elements distinct from any union of up to p others. 
This condition is equivalent to that of g-error detection and hence optimal q- 
failure designs are also optimal g-error-detecting designs. In the g = case, we 
require that no subset in the collection is contained in the union of p others. 
Hwang & Sos showed that this requirement characterizes the p-complete 
designs defined by Bush et al. 0. 

In Theorems 1 and 2 we establish lower bounds on f as a function of n for 
p = 1 and 2 and all g > 0. The bounds coincide in some cases with the sizes of 
certain Steiner system solutions which hence are optimal. These results extend 
the results of Erdos et al. who considered the case p = 2 and g = 0. These 
authors initially constrained the designs to be uniform, that is each object occurs 
in the same number of pools. They subsequently derived an asymptotic bound 
in the unconstrained case. Here, we do not require uniformity but we note that 
the bounds given in Theorems 1 and 2 can only be achieved by uniform designs. 
Ruszinko [0 derives asymptotic bounds for g = and arbitrary p, but in the 
case p = 2 the bound obtained by Erdos et al. is tighter. 

2 Definitions and statement of results 

For positive integers < i < j, let Xj denote the set of subsets of {1, 2, . . . , j} 
and define 

x;^{cex,■.\c\=^}. (1) 

Pools are elements of A'„ and designs are subsets of Xn. Given a design V = 
{Al, . . . , Ay} we will write T> = {Bi, . . . , Bn} for the dual of T> defined by 
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i G Aj if and only if j G -Bj. Thus Aj indexes the objects in the jth pool 
whereas Bi indexes the pools which contain the ith object. Let (f){A) denote the 
set of indices of bad pools in V when the objects indexed by A are bad and no 
failures occur, that is 

0(A) = U.eABi. (2) 

We say that D is a p-bad, 0-failure solution, or [p, 0)-solution, if from 0(A) we 
can infer either A or that |y4| > p, assuming that no failures occur. This occurs 
if and only if 

(f){A) ^ (j){A') for all A, A' G Xn such that A ^ A' and \A\ < p. (3) 

Note that in the hypergeometric case (Hwang & Sos, 0]), 4>{A) ^ 0(A') is 
required only when both |A| < p and \ A\ < p. 

We define P to be a p-bad, g-failure solution, or {p, g)-solution, if from 0(A) 
we can infer either A or that |A| > p, even in the presence of up to q failures. 
This occurs if and only if each (f — g)-subset of P is a {p, 0)-solution. We write 
(Tp g for the set of duals of {p, g)-solutions and say that V is optimal if V has 
maximum cardinality in a^g. From (|]) it follows that P is a (p, g)-solution if 
and only if 

10(A) A0(A') I > q for all A, A' G such that A ^ A' and |A| < p, (4) 

in which BAC = {B\C) U {C\B). Note that (g) can be regarded as the defini- 
tion of a solution in the case that test failures do not occur but up to q wrong 
outcomes may be recorded and the detection of any such error is required. Hence 
optimal p-bad, g-failure solutions are also optimal p-bad, g-error-detecting so- 
lutions. 

Proposition 1 A design T> is a (p, q)-solution, that is T> & a'^^, if and only if 

\Bi \ 0(A) I > q for every A & Xn with \A\ < p and all i E {1,2,..., n}\A. 

(5) 

Corollary 1 A design D satisfies T) E if and only if > q for all 

distinct B,B' eV. 

Proof By considering the case that A G X^ and A' = AU {i} for some i ^ A, 
we see that (^) is necessary for (^). Suppose that A, A' G Xn with A 7^ A' 
and |A| < p. If A'\A ^ then it follows from (D that |0(A')\0(A)| > q. 
Alternatively, if A'\A = then both |A'| < p and A\A' 7^ and hence (|) 
implies that |0(A)\0(A')| > q. In either case we have |0(A)A0(A')| > q and 
hence (|^) is sufficient for (§). 
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Let < t < k < V. A (t,k, t>)-packing is a set P C such that the 
intersection of any two elements of V has cardinahty at most t. A direct corol- 
lary of Proposition 1 is that if I) is a (t,pt+q+l,v)-packmg then V G 0"^^. 

If \V\ = then each element of X^'^^ is contained in precisely one 

element of V and V is also called a Steiner system, denoted S(t+1, k,v). For 
further details including a list of small Steiner systems known to exist, we refer 
to Beth et al. M. 



Theorem 1 If a design D satisfies t) & then n = \T>\ satisfies 



n<^,:A (6) 



in which Kq = 1 and, for q even, 

'4^f[v/2\\f\v/2r 



(\ 

s=0 \ 



while for q odd. 



(7) 



where T = [2[v/2\/{q+l)\. 

Corollary 2 (Sperner, 1928) The set A'^LVsj optimal in ctJ'q. 
Corollary 3 //S'([f/2J — 1, \y/2\,v) exists then it is optimal in a^i- 
Theorem 2 If a design V satisfies V G a2^g then 

in which t* is the least integer value of t such that 

v<5t + 2+ ^^^~^\ (10) 
t+q 

so that t* = \{v-2)/5] if q = or 1. 
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Corollary 4 If S(t* ,2t*+q—l,v) exists then it is optimal in a2q- 
In the p = 1 case, Stirling's formula and give 

n< Lg/2j!rg/2l!^^, (11) 

and hence asymptotically 

„> 1^(1+0(1)). (12) 

log(2) 

For p = 2, Stirling's formula and (pi) give 



and hence 



5t)+l/2 



7^(1+0(1))- (14) 



log(5/4) 

The asymptotic bound (|T^) was obtained by Erdos et al. 



3 Proof of Theorem 1 

By Corollary |l], if g = then each T) ^ a^^ satisfies the requirement of no 
pairwise containment and the theorem was proved in this case by Sperner 
A simple proof of Sperner's result is given by Lubell |^. Here, and in the 
sequel, 'chain' will always mean a maximal chain of X^, ordered by inclusion. 
Each chain contains at most one element of T) and hence we can associate with 
each B eT) a. 'cost' which is the proportion of all chains which contain B and 
hence no further element of T). For any /c, the set of chains can be partitioned by 
the fc-sets into equal parts. Therefore the cost of 5 is 1/ (^^, where k = \B\, 

which is minimized at k = \y/2\. Since AfJ-^'/^J consists only of minimal cost 
elements but achieves the maximum total cost of one, it is optimal in ctJ'q. 

This argument can be extended to the p = 1, q > case. Suppose first that 
q is even and consider B eT) with \B\ = k. Define the s-neighbours of B to be 
the sets C E such that \B\C\ = \C\B\ = s. We will say that B 'blocks' the 
chains which contain one of its s-neighbours for some s < q/2. Let B' denote an 
element of V distinct from B. If a chain contains both an s-neighbour of B and 
an s'-neighbour of B' then either \B\B'\ < s+s' or < s+s'. It follows 

from Corollary |I| that we cannot have both s < q/2 and s' < q/2. Therefore a 
chain cannot be blocked by more than one element of V. Each chain blocked by 
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B can contain no element of V other than B and hence we can associate with 
B the cost h{B) which is the proportion of chains blocked by B. The value of 
h{B) is 

h{B)=K,,,(^^ , (15) 

where Kg ^ denotes the total number of s-neighbours of B with s < q/2, which 
is given by 




It is readily verified that h{B) is minimized when k = [f/2j or [f/2] and, 
in either case, the value of Kg^k is Kg, defined at (|^). Since chains cannot 
be multiply blocked, the sum of the costs of the elements of V cannot exceed 
one. Therefore n is bounded above by the inverse of the minimal cost which 
establishes (^ in the case q even. 

Now suppose that q is odd. Each chain which contains an s-neighbour of 
B, for some s < (g+l)/2, can contain no element of T> other than B and we 
say that these chains are blocked hj B. If B' is an element of T> distinct from 
B then no chain can be both an s-neighbour of B and an s'-neighbour of B' 
when s+s' < q. However, if either \B\B'\ = q+1 or \B'\B\ = q+1 then there 
exist chains which contain both a ((g+l)/2)-neighbour of B and a ((g+l)/2)- 
neighbour of B' and hence such chains are multiply blocked. 

Let C denote the set of elements of V which have a ((g+l)/2)-neighbour in 
the chain {0, {1}, {1, 2}, . . . , {1, 2, . . . , v}}. UB eC with \B\ = k then 

\B n {k+l, k+2, . . . , ^;}| = i±l = |{1, 2, . . . , fc} \ 51. (17) 

Further, ii B' e C and B' ^ B then, since both > q and > q must 

be satisfied, B' contains {1, 2, . . . , k}\B and is disjoint from {k+l, k+2, . . . , v}r\ 
B. Hence \C\ cannot exceed mm{[2k / (q+l) \ , [2(f — A;)/(g+l)J }, which takes 
maximum value T when k = \y/2\. Therefore the number of ((g+l)/2)- 
neighbours in P of a given chain is at most T. 
Associate with B a cost 

h'{B) = K'g^,l^^ , (18) 

where Kg f^ is Kg^i^k plus 1/T times the number of ((g+l)/2)-neighbours of B. 
This cost is minimized at = [f/2j or \v/2~\ and in either case K'g^ is Kg 
defined at (^. The sum of the costs of the elements of T) cannot exceed one 
and hence the theorem. 
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4 Proof of Theorem 2 



Definition 1 For any ^ € Ug^, we follow Erdds et al. and say that b G Xy 
is private in V if there exists a unique B eT) such that b C B. 

Definition 2 If B E with \B\ > q then J-' G Xy is a (2,q)-cover of B 
precisely if both 

1. if b E and b G b' B then b' G T ; and 

2. for every b G B with \B\b\ < q at least one part of every two-partition of 
b is in T . 



Lemma 1 A design V satisfies f> E if and only if for each B eT>, the sets 
which are private in f) form a (2, q)-cover of B. 

Proof Let b G B with \B\b\ < q. It follows from Proposition 1 that \B\ > q 
and b is private in T>. If there exists a partition of b into two non-private 
parts then there must be C and C in V\{B} such that 6 C (C U C) and 
hence \B\ (CUC")| < q, which contradicts Proposition 1. Therefore a necessary 
condition for V E al^g is that for each B E T) the private subsets of i? in I? 
form a (2, g)-cover. Sufficiency is immediate from Proposition 1. 



Definition 3 For JF a {2,q)-cover of B E Xy, define h{B,J^) to be the propor- 
tion of all chains which intersect T . 



Proposition 2 For any B E Xy with \B\ = k such that q < k < v—1 and T a 

(2, q)-cover of B, 

in which t = [(A;— g-|-l)/2j . Equality is achieved in ^T^ ) if and only if k—q is 
odd and T = T* where T* = {b G B t}. 

Proof li k = 2t+g— 1 then every two-partition of any (fc— g)-subset of B con- 
tains a part b such that |6| > t and hence J-'* is a (2, g)-cover of B. Further, 
h{B, JF*) is precisely the proportion of chains which contain a t-subset of B and 
hence h{B,J^*) achieves equality in (|19|). 
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Suppose that k = 2t+q—l and let s denote the largest integer such there 
exists some b G B with |6| = t+s and h ^ T . It follows from Definition |^ that 
s < t-1. If s < then either T = T* oi T* d T and h{B,T*) < h{B,J^) 
and hence we may assume that s > 0. If there exists some h & such that 
|6| < t-s-l then J^\{h} is also a (2, g)-cover of B and h{B, J^\{h}) < h{B, J"). 
Hence we may assume that there is no such b. From Definition 0, if 6 is a 
(/c— g)-subset of B then the number of (t+s)-subsets of b not in JF is not greater 
than the number of (t— s— l)-subsets of b in JF. Summing over all such b, each 
r-set occurs in (^^~''^ terms of the sum and hence 

(20) 

in which fj. denotes the number of r-subsets of B not in JF while fr denotes the 
number of such subsets in JF. Inequality (pOD is equivalent to 




. . , (k-t+s+mt-s-i)\ 

- [k-t-s)\{t+s)\ ■ ^^^^ 

Construct JF' from JF by removing all (t— s— l)-sets and adding any missing 
(t+s)-subsets of B, so that 

r = {bcB:\b\= t+s} UJ^\{bcB:\b\= t-s-l}. (22) 

Now, JF' is also a (2, g)-cover of B and h{B^J^') — h{B,J^) is precisely the 
proportion of chains which contain a (t+s)-subset of B but no element of JF 
minus the proportion of chains which contain a (t— s— l)-set in JF but no other 
element of J-'. Therefore 

h{B,r)-h{B,:F) = ^^f,J^''^ ] ' ^r^/^-.-iT ,1 (23) 



v-t+s+r \t-s 

and hence h{B^J-'') > h{B^J^) if and only if 

f >f {v-t+s)\{t-s-l)\ 

- ^'-'-\v-t-s-mt+s)\- ^^^^ 



Since k+1 < v, inequality ( p^ contradicts ( pT]) and hence h{B,J^') < h{B,!F). 
Therefore h{B, T) is not minimal and the proposition is established in this case. 

When k = 2t and q = 0, for every partition of B into two t-sets, one of the 
parts is in JF. Hence JF contains at least ^^^^ ^^^^ ^- F^o™ argument 

similar to that above, it is readily shown that h{B^T) is minimized when JF 
contains every (t+l)-subset of B but no (t— l)-subset. Hence the bound (|T9|) 
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follows with strict inequality. If A; = 2t+q, q > 0, then for any x G B the set 
{6 G : X ^ 6} is a (2, g— l)-cover of B\{x} and the proposition follows from 
the case k = 2t+q—l. 

Proof of Theorem 2 Let Xt denote the RHS of ([T^). Then 

xt+i ^ (2t+g+l)(2t+g) 
Xt {v-t){t+q) ^ ' 

which exceeds one if and only if inequality ( p!0D is not satisfied. Hence Xt is 
minimized aX t = t* . From Definition |1|, if i? G I? and G 0"2g then any 
chain which intersects JF cannot intersect any other set which is private in T). 
Therefore, invoking Lemma n = 1^*1 is bounded above by the inverse of 
the minimum value of h{B,J^) over all a (2, g)-cover of B, which in turn is 
bounded above by l/xt*. 
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